Online Job Scheduling with Redundancy and Opportunistic Checkpointing: A Speedup-Function-Based Analysis

نویسندگان

  • Huanle Xu
  • Gustavo de Veciana
  • Wing Cheong Lau
  • Kunxiao Zhou
چکیده

In a large-scale computing cluster, the job completions can be substantially delayed due to two sources of variability, namely, variability in the job size and that in the machine service capacity. To tackle this issue, existing works have proposed various scheduling algorithms which exploit redundancy wherein a job runs on multiple servers until the first completes. In this paper, we explore the impact of variability in the machine service capacity and adopt a rigorous analytical approach to design scheduling algorithms using redundancy and checkpointing. We design several online scheduling algorithms which can dynamically vary the number of redundant copies for jobs. We also provide new theoretical performance bounds for these algorithms in terms of the overall job flowtime by introducing the notion of a speedup function, based on which a novel potential function can be defined to enable the corresponding competitive ratio analysis. In particular, by adopting the online primal-dual fitting approach, we prove that our SRPT+R Algorithm in a non-multitasking cluster is (1 + )-speed, O( 1 )-competitive. We also show that our proposed Fair+R and LAPS+R(β) Algorithms for a multitasking cluster are (4 + )-speed, O( 1 )-competitive and (2 + 2β + 2 )-speed O( 1 β )-competitive respectively. We demonstrate via extensive simulations that our proposed algorithms can significantly reduce job flowtime under both the non-multitasking and multitasking modes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Comparative Analysis of Fault Tolerance Techniques in Grid Environment

Grid being a collection of heterogeneous resources connected through network, to execute complex jobs with high processing power requirements, is more vulnerable to faults. Faults may affect the performance and QoS of Grid. Faults are dealt with either avoiding them or recovering them by either re-execution or by resuming the execution from the point of failure by using the checkpoints. The var...

متن کامل

Analysis of checkpointing for schedulability of real-time systems

Checkpointing is a relatively cost effective method for achieving fault tolerance in real-time systems. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. This paper provides exact schedulability tests for fault tolerant task sets under specified failure hypothesis and employing checkpointing to assist in fau...

متن کامل

Online Scheduling of Jobs for D-benevolent instances On Identical Machines

We consider online scheduling of jobs with specic release time on m identical machines. Each job has a weight and a size; the goal is maximizing total weight of completed jobs. At release time of a job it must immediately be scheduled on a machine or it will be rejected. It is also allowed during execution of a job to preempt it; however, it will be lost and only weight of completed jobs contri...

متن کامل

A Fault Tolerant Scheduling System Based on Checkpointing for Computational Grids

Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant job scheduling system based on checkpointing technique is presented and evaluated. When scheduling a jo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.01655  شماره 

صفحات  -

تاریخ انتشار 2017